Artificial intelligence-aided method to detect uterine fibroids in ultrasound images: a retrospective study

We explored a new artificial intelligence-assisted method to assist junior ultrasonographers in improving the diagnostic performance of uterine fibroids and further compared it with senior ultrasonographers to confirm the effectiveness and feasibility of the artificial intelligence method. In this retrospective study, we collected a total of 3870 ultrasound images from 667 patients with a mean age of 42.45 years ± 6.23 [SD] for those who received a pathologically confirmed diagnosis of uterine fibroids and 570 women with a mean age of 39.24 years ± 5.32 [SD] without uterine lesions from Shunde Hospital of Southern Medical University between 2015 and 2020. The DCNN model was trained and developed on the training dataset (2706 images) and internal validation dataset (676 images). To evaluate the performance of the model on the external validation dataset (488 images), we assessed the diagnostic performance of the DCNN with ultrasonographers possessing different levels of seniority. The DCNN model aided the junior ultrasonographers (Averaged) in diagnosing uterine fibroids with higher accuracy (94.72% vs. 86.63%, P < 0.001), sensitivity (92.82% vs. 83.21%, P = 0.001), specificity (97.05% vs. 90.80%, P = 0.009), positive predictive value (97.45% vs. 91.68%, P = 0.007), and negative predictive value (91.73% vs. 81.61%, P = 0.001) than they achieved alone. Their ability was comparable to that of senior ultrasonographers (Averaged) in terms of accuracy (94.72% vs. 95.24%, P = 0.66), sensitivity (92.82% vs. 93.66%, P = 0.73), specificity (97.05% vs. 97.16%, P = 0.79), positive predictive value (97.45% vs. 97.57%, P = 0.77), and negative predictive value (91.73% vs. 92.63%, P = 0.75). The DCNN-assisted strategy can considerably improve the uterine fibroid diagnosis performance of junior ultrasonographers to make them more comparable to senior ultrasonographers.


Model Developments
As an anchor-based and one-stage detection model, YOLOv3 has a faster detection speed than the other two-stage detectors, but the accuracy is not degraded too much. This is mainly due to the basic idea of its design.
YOLOv3 used an end-to-end architecture, which effectively utilizes the global. In addition, the author uses a deeper feature extraction network called darknet53 in YOLOv3, which contains batch normalization, a mixed residual network block and continuous 3 × 3 and 1 × 1 convolutions. Thus, the convergence effect is enhanced. It can directly predict the probability of an object category and the coordinates of its location, so the detection speed is relatively fast. However, it has a poor detection effect on small targets.
When the deep learning network layer is deeper, the expressive power is theoretically stronger. After the CNN reaches a certain depth, its classification performance will not be enhanced when its depth still deepens, but it will cause the network convergence to be slower and the accuracy rate to decrease. The network structure of ResNet can effectively solve the problem of gradient explosion and gradient disappearance caused by network depth. To more accurately detect relatively weak uterine fibroid targets in ultrasound images, we replaced the backbone darknet53 of YOLOv3 with ResNet50. The ResNet50 network structure is schematically shown in Table S1.
To the best of our knowledge, no algorithm has been applied to the detection of uterine fibroids. We compared our method with other well-known algorithms, trained each algorithm using our dataset, and compared the results with those of our method. The detailed results are shown in Table S2.

Data Enhancement
Data enhancement was necessary to improve the generalizability and robustness of the developed network. We used the following methods to enhance the data in the experiment.
We randomly zoomed and normalized the image size to between -0.5 and 0.5 and increased the hue of the image by -18 to 18 and the saturation, brightness and contrast by 0.5 to 1.5 with a probability of 0.5; 1) randomly turned the image left and right and randomly distorted the image; 2) the image was randomly expanded; the execution probability was 0.5, the maximum expansion ratio was 4, and the filling color values for expansion were R: 123.675, G: 116.28, and B: 103.53; 3) the image was cropped randomly; the ratio of the length to the width of the cropped area was 0.5 to 2, the effective intersection over union (IoUs) threshold was (0, 0.1, 0.3, 0.5, 0.7, 0.9), and the ratio of the cropped area to the original image was 0.3 to 1.

Machine Environment
All training and testing procedures were developed with Paddle (version 2.0.2), CUDA (version 10.1) and Python (version 3.7). Four graphics processing units (GPUs; NVIDIA GeForce GTX 1080Ti) were used, and the total training time was 10 hours. The Adam optimizer was initialized, and each mini-batch contained 12 images. The weight decay was set as 0.0005, and the momentum was set as 0.9.
We added different tricks to the image preprocessing, the structure of the backbone and the regression branch of the loss function to train the model. Compared with the original YOLOv3 network, our results are substantially improved. However, there are still many improvements in the algorithm for YOLOv3, such as optimization of target confidence and feature extraction.

Additional Architecture for the Development of a Deep Learning Model
The loss function of YOLOv3 is mainly composed of three parts: the coordinate loss of the prediction box, the classification loss of the target and the confidence loss of the category.
We added different tricks to the regression branch of the loss function to train the model.
The model loss function consists of four parts: 1) Losses are determined based on the predicted center coordinates using the following loss function: λ is a given constant that represents the weight of the loss. ( , ) is the actual position obtained from the training data, and (̂,̂ ) is the position of the predicted bounding box. 2 denotes the grid size. stands for "box".
indicates that if the box at ( , ) has a target, 2) The width and height of the prediction bounding box are lost, and the following loss function is used: and indicate the width and height of the GT box, respectively.
3) The following loss function is used to determine the loss of the predicted category:

4)
A loss is formed based on the confidence of the prediction by using the following loss function: is the confidence score, and ̂ is the intersection of the predicted bounding box and the GT boxes when there is no object in grid cell, = 1; otherwise, = 0.
5) The total loss function is: where λ is a given constant that represents the weight of the loss, and the highest penalty is given for the coordinate prediction ( = 5) with the lowest confidence prediction penalty when no target is detected ( = 0.5).
The parameters of the model were set as follows: An output layer and a fully connected layer of ResNet50 were modified according to our task.
ResNet50 structure with the last layer replaced with a softmax layer and pretrained on ImageNet (a large-scale hierarchical image database) and then fine-tuned on our training set.
We assigned different weights to each class when computing the loss to equalize the unbalanced datasets. Higher weights are assigned to classes with fewer images, and lower

Model Fine-tuning
Fine-tuning is performed according to the following steps: 1. The neural network model with softmax layers of 1000 categories is pretrained on the source dataset (ImageNet dataset), i.e., the source model.

Create a new neural network model that replicates all the model designs and their
parameters on the source model except the output layer, i.e., the target model.
3. Our goal is to perform 2 classifications, so the new softmax layer of the target model will consist of 2 categories instead of 1000 categories and randomly initialize the model parameters of the softmax layer.
weights are assigned to classes with more images. Batch normalization, learning rate decay and cross validation methods were used to reduce the risk of model overfitting. After trying different configurations, we obtained the best results using a batch size of 12, and stochastic gradient descent (SGD) was adopted to update the weights of the network. The IoU threshold was set to 0.5, and the confidence threshold was set to 0.5. The detection performance of each model was tested on 268 pieces of data containing uterine fibroids. The initial learning rate was 0.0005, and each minibatch contained four images. The AP was obtained by integrating the AUC. This task was single target detection, and AP=mAP at this time.
4. Train the target model on the dataset of ultrasound images we collected. The output layer is trained from the head, while the parameters of the remaining layers are fine-tuned based on the parameters obtained from the source model.

Evaluation Indicators for the Comparison Experiments Between the Models
For the uterine fibroid detection model YOLOv3 in our study, the threshold is used as a limiting metric in several settings. Among them, the IoU threshold metric was used as a IoU: This metric measures the degree of overlap between two detection frames (for target detection). The method of evaluating IoU is shown in Fig. S1.

Results
The experimental results show that the recall rate and accuracy of the YOLOv3+ model are 92.91% and 95.04%, respectively. Although the Faster RCNN model performs well in most detection tasks, it performs poorly in this task. It can be seen from the number of DRs that the Faster RCNN model is more suitable for multitarget detection tasks. In addition, the F1 score of the proposed model is 93.96%. which is substantially better than the second-highest detection model YOLOv3 (F1 score of 87.48%). YOLOv3 is commonly used for target detection in the medical field. The mAP (92.77%) of the proposed model is much higher than that of the other models, which proves that the model is more sensitive to fibroids and can detect uterine fibroids more accurately.
Although the YOLOv3+ model is slightly weak in computing speed, the test time on a single ultrasound uterine image is 162ms, which is slightly slower than the single-stage monitoring model SSD (73ms) and YOLOv3 (130ms), but it is also far beyond the average reading speed of professional doctors. Obviously, the unique residual structure of the model is an effective structure for detecting uterine fibroids.  Note  is used as the feature extractor in the manuscript. Convs = convolution layers.